Project 2 - 2-Mode Networks - Airlines and Airports

Betsy Rosalen and Mikhail Groysman

Project Overview

  1. Identify a large 2-node network dataset—you can start with a dataset in a repository. Your data should meet the criteria that it consists of ties between and not within two (or more) distinct groups.
  2. Reduce the size of the network using a method such as the island method described in chapter 4 of social network analysis.
  3. What can you infer about each of the distinct groups?

Choosing the Data

We have decide to build on the flight data from OpenFlights.org that we used in our first project.

OpenFlights consists of the following datasets:

  • Airports
  • Airlines
  • Routes
  • Planes
  • Schedules

We have choosen to use the Routes and Airports datasets to build our network. We will use Airline ID and Source Airport ID from the Routes dataset as our two nodes. Each route record will be an edge between the two ntypes of nodes and we will add weighting by the number of routes that each airine has leaving from each particular airport. We will get additional airport node attribute data from the Airports dataset.

Variables used from the Routes dataset:

  • Airline ID - Unique OpenFlights identifier for airline (see Airline).
  • Source airport ID - Unique OpenFlights identifier for source airport (see Airport)

Variables used from the Airports dataset:

  • Name
  • City
  • Country
  • IATA
  • Latitude
  • Longtitude

Data is saved as .DAT files. They are UTF-8 encoded.

Loading the Data

First import necessary packages for plotting graphs using NetworkX and Matplotlib and set up graph size parameters...

In [1]:
import networkx as nx
from networkx.algorithms import bipartite as bi
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
%matplotlib inline

Read in Routes data to a Pandas dataframe. We only need the Airline and Airport columns so we selected for just those two.

In [2]:
routes_raw = pd.read_csv("Project1_Data/routes.dat", 
                       header=None, sep=",", 
                       usecols=[0, 2], 
                       names=['Airline', 'Airport'])

routes_raw=routes_raw.dropna()
In [3]:
routes_raw.head(15)
Out[3]:
Airline Airport
0 2B AER
1 2B ASF
2 2B ASF
3 2B CEK
4 2B CEK
5 2B DME
6 2B DME
7 2B DME
8 2B DME
9 2B EGO
10 2B EGO
11 2B GYD
12 2B KGD
13 2B KZN
14 2B KZN

Add Weights to the Dataframe

We chose to count the number of routes for each eirline at each airport and use that as a weight for our airline to airport edges.

In [4]:
aa = routes_raw.copy()
In [5]:
aa['weight'] = aa.groupby('Airline')['Airport'].transform('count')
In [6]:
aa.head(10)
Out[6]:
Airline Airport weight
0 2B AER 42
1 2B ASF 42
2 2B ASF 42
3 2B CEK 42
4 2B CEK 42
5 2B DME 42
6 2B DME 42
7 2B DME 42
8 2B DME 42
9 2B EGO 42
In [7]:
len(aa)
Out[7]:
67663

Get Additional Airport Data

In [8]:
airports_raw = pd.read_csv("Project1_Data/airports.dat", 
                       header=None, sep=",", 
                       usecols=[1, 2, 3, 4, 6, 7], 
                       names=['Airport Name','City','Country','IATA', 'Latitude', 'Longitude'])
#airports_raw1=airports_raw[airports_raw.Country=='France']
In [9]:
airports_raw.head()
Out[9]:
Airport Name City Country IATA Latitude Longitude
0 Goroka Airport Goroka Papua New Guinea GKA -6.081690 145.391998
1 Madang Airport Madang Papua New Guinea MAG -5.207080 145.789001
2 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU -5.826790 144.296005
3 Nadzab Airport Nadzab Papua New Guinea LAE -6.569803 146.725977
4 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM -9.443380 147.220001
In [10]:
len(airports_raw)
Out[10]:
7698
In [11]:
aa_merged = airports_raw.copy()
In [12]:
aa_merged = pd.merge(aa,aa_merged,left_on='Airport',right_on='IATA',how='inner')
In [13]:
aa_merged.head()
Out[13]:
Airline Airport weight Airport Name City Country IATA Latitude Longitude
0 2B AER 42 Sochi International Airport Sochi Russia AER 43.449902 39.9566
1 7J AER 34 Sochi International Airport Sochi Russia AER 43.449902 39.9566
2 9U AER 38 Sochi International Airport Sochi Russia AER 43.449902 39.9566
3 B2 AER 82 Sochi International Airport Sochi Russia AER 43.449902 39.9566
4 HY AER 151 Sochi International Airport Sochi Russia AER 43.449902 39.9566
In [14]:
airport_nodes = aa_merged[['Airport','Airport Name','City','Country','IATA','Latitude','Longitude']].drop_duplicates().sort_values(by='Airport')

Check for Airline Codes that Match Airport Codes

In [15]:
set(aa_merged['Airline']) & set(aa_merged['Airport']) 
Out[15]:
{'ABJ', 'IOS'}

There are two three letter codes that are used for both airlines and airports, so we have to add differentiation to the names before adding these airport atributes to the graph or all our airlines will also get coded as airports... To keep things short and so we can still read out graph once it's plotted, we added just an L for airLines just to those two airline codes.

In [16]:
#aa_merged['Airline'] = 'L_' + aa_merged.Airline.map(str)
#aa_merged['Airport'] = 'P_' + aa_merged.Airport.map(str)
#airport_nodes['Airport'] = 'P_' + airport_nodes.Airport.map(str)

aa_merged['Airline'] = np.where(
    aa_merged['Airline'].isin(['ABJ', 'IOS']),
    'L_' + aa_merged.Airline.map(str),
    aa_merged.Airline
)
In [17]:
set(aa_merged['Airline']) & set(aa_merged['Airport']) 
Out[17]:
set()
In [18]:
aa_merged.head()
Out[18]:
Airline Airport weight Airport Name City Country IATA Latitude Longitude
0 2B AER 42 Sochi International Airport Sochi Russia AER 43.449902 39.9566
1 7J AER 34 Sochi International Airport Sochi Russia AER 43.449902 39.9566
2 9U AER 38 Sochi International Airport Sochi Russia AER 43.449902 39.9566
3 B2 AER 82 Sochi International Airport Sochi Russia AER 43.449902 39.9566
4 HY AER 151 Sochi International Airport Sochi Russia AER 43.449902 39.9566

Create NetworkX Graph Object

First we will create a full bipartite network with both nodes.

In [19]:
B = nx.Graph()
B.add_nodes_from(aa_merged['Airline'], bipartite=0)
B.add_nodes_from(aa_merged['Airport'], bipartite=1)
B.add_weighted_edges_from([tuple(d) for d in aa_merged[['Airline','Airport','weight']].values])

nx.set_node_attributes(B, airport_nodes.set_index('Airport').to_dict('index'))

Let's double check if our attributes were saved in the graph data.

In [20]:
list(B.nodes(data=True))[:3]
Out[20]:
[('2B', {'bipartite': 0}), ('7J', {'bipartite': 0}), ('9U', {'bipartite': 0})]
In [21]:
list(B.nodes(data=True))[-2:]
Out[21]:
[('WIN',
  {'bipartite': 1,
   'Airport Name': 'Winton Airport',
   'City': 'Winton',
   'Country': 'Australia',
   'IATA': 'WIN',
   'Latitude': -22.36359977722168,
   'Longitude': 143.08599853515625}),
 ('WYA',
  {'bipartite': 1,
   'Airport Name': 'Whyalla Airport',
   'City': 'Whyalla',
   'Country': 'Australia',
   'IATA': 'WYA',
   'Latitude': -33.05889892578125,
   'Longitude': 137.51400756835938})]
In [22]:
B.get_edge_data('L_AER','P_KZN')
In [23]:
print(nx.info(B))
Name: 
Type: Graph
Number of nodes: 3819
Number of edges: 19057
Average degree:   9.9801

Set 'Top' and 'Bottom' Nodes

In [24]:
airline_nodes = {n for n, d in B.nodes(data=True) if d['bipartite']==0}
airport_nodes = set(B) - airline_nodes

Check if the Graph is Connected

In [25]:
nx.is_connected(B)
Out[25]:
False

Check if the Graph is Bipartite

In [26]:
bi.is_bipartite(B)
Out[26]:
True

Density

In [27]:
print(bi.density(B, airline_nodes))
0.010335248855134054
In [28]:
print(bi.density(B, airport_nodes))
0.010335248855134054

Use Island Method on the Bipartite Graph

The functions below will use the edge weights to trim the graph into subgraphs at different minimum weight levels. Any node not connected by at least one ede with the minimum weight will be cut from the graph. We used the textbook functions as a starter, but had to revise them to get them to work...

In [29]:
'''
# textbook function doesn't work...
def trim_edges(g, weight=1):
    g2=nx.Graph()
    for f, to, edata in g.edges(data=True):
        if edata['weight'] > weight:
            g2.add_edge(f, to, edata)
    return g2
'''
def trim_edges(g, weight=1): 
    g2=nx.Graph()  
    my_list=[]
    my_list1=[]
    for f, to, edata in g.edges(data=True): 
        if edata['weight'] > weight:
            my_list.append(f)
            my_list1.append(to)
            g2.add_edge(f,to,attr_dict={weight:edata['weight']})
    g2.add_nodes_from(my_list, bipartite=0)
    g2.add_nodes_from(my_list1, bipartite=1)
    return g2

def island_method(g, iterations=5, weight=1):
    weights = [edata['weight'] for f, to, edata in g.edges(data=True)]
    mn=int(min(weights)) if int(min(weights)) > weight else weight
    mx=int(max(weights))
    #compute the size of the step, so we get a reasonable step in iterations
    step=int((mx-mn)/(iterations-1))
    return [[threshold, trim_edges(g, threshold)] for threshold in range(mn,mx,step)]
Run island_method function using 4 iterations and a minimum weight 'water line' of 350...
In [30]:
islands = island_method(B, 4, 350)
print('min weight - ', '# of nodes - ', '# of island subgraphs')
for i in islands:
    # print the threshold level, size of the graph, and number of connected components
    print(i[0], '     -     ', len(i[1]), '     -     ', len(list(nx.connected_component_subgraphs(i[1]))))
min weight -  # of nodes -  # of island subgraphs
350      -      1733      -      1
1061      -      1060      -      1
1772      -      776      -      1
2483      -      177      -      1
In [31]:
islands[1][0] # the first index refers to the threshold
Out[31]:
1061
In [32]:
islands[1][1] # the second index refers to the graph object
Out[32]:
<networkx.classes.graph.Graph at 0xa1f1b3110>

Graph the Largest 'Island' Subgraph - minimum weight level 350

In [33]:
def set_colors(G):
    colors = []
    for node, data in G.nodes(data=True):
        if data['bipartite'] == 1: 
            colors.append('olivedrab') # Airports in Green    
        else:
            colors.append('cornflowerblue')  # Airlines in Blue
    return colors
In [34]:
G0=max(nx.connected_component_subgraphs(islands[0][1]), key=len)
In [63]:
plt.rcParams["figure.figsize"] = (15,15) # set plot size
colors = set_colors(G0)  # set colors
#weights = [math.log(edata['attr_dict'][200]) for f, t, edata in G0.edges(data=True)] # set weights

nx.draw(G0, with_labels=True, node_color=colors, node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

We can start to see already how many airports are connected by a much smaller number of airlines (as we would expect). We can also see how airlnes serve as boundary spanners between airports.Let's raise the water level though to see if we can gain more insight...

Graph the Second Largest 'Island' Subgraph - minimum weight level 1061

In [36]:
G1=islands[1][1]
In [62]:
plt.rcParams["figure.figsize"] = (15,15) # set plot size
colors = set_colors(G1)  # set colors

nx.draw(G1, with_labels=True, node_color=colors, node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

Major airlines are starting to be visible with the airports that they connect. Many cliques are startig to develop and airlines are starting to show as boundary spanners. A higher water level may make it easier to pick out individual airlines and airports.

Graph the Third Largest 'Island' Subgraph - minimum weight level 1772

In [38]:
G2=islands[2][1]
In [39]:
plt.rcParams["figure.figsize"] = (15,15) # set plot size
colors = set_colors(G2)  # set colors

nx.draw(G2, with_labels=True, node_color=colors, node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

Here we can really start to see the 4 most influential airlines in the network:

  • FR - Air France
  • UA - United Airlines
  • DL - Delta Air Lines
  • US - US Airways
  • AA - American Airlines

If there are any others they are still hidden in the graph. Air France clearly stands out as a boundary spanner in this graph.

Graph the Fourth Largest 'Island' Subgraph - minimum weight level 2483

In [40]:
G3=islands[3][1]
In [41]:
plt.rcParams["figure.figsize"] = (8,8) # set plot size
colors = set_colors(G3)  # set colors

nx.draw(G3, with_labels=True, node_color=colors, node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

In this last graph we can see that FR, Air France, is the most influential airline in the network.

Airport Node Projected Graph

In [42]:
# compute an affiliation network of the Airports
airports = bi.weighted_projected_graph(B, airport_nodes)

# Find the largest connected subgraph in the network
#users_subgraph = nx.connected_component_subgraphs(users)[0]  # textbook code doesn't work
airports_subgraph = max(nx.connected_component_subgraphs(airports), key=len)  # alternate method

airports_subgraph.name = "Airports"

print(nx.info(airports_subgraph))
Name: Airports
Type: Graph
Number of nodes: 3235
Number of edges: 432949
Average degree: 267.6655

Run island_method function using 3 iterations and a minimum weight 'water line' of 25...

In [43]:
airport_islands = island_method(airports_subgraph, 3, 25)
print('min weight - ', '# of nodes - ', '# of island subgraphs')
for i in airport_islands:
    # print the threshold level, size of the graph, and number of connected components
    print(i[0], '     -     ', len(i[1]), '     -     ', len(list(nx.connected_component_subgraphs(i[1]))))
min weight -  # of nodes -  # of island subgraphs
25      -      118      -      1
49      -      23      -      1
73      -      2      -      1

Graph the Largest Airport Node 'Island' Subgraph - minimum weight level 25

In [44]:
APG0=airport_islands[0][1]
In [45]:
# set plot size
plt.rcParams["figure.figsize"] = (15,15)

nx.draw(APG0, with_labels=True, node_color='olivedrab', node_size=200, 
        font_size=10, font_weight='bold', edge_color="olivedrab", alpha=0.5)

Graph the Second Largest Airport Node 'Island' Subgraph - minimum weight level 49

In [46]:
APG1=airport_islands[1][1]
In [47]:
# set plot size
plt.rcParams["figure.figsize"] = (6,6)

nx.draw(APG1, with_labels=True, node_color='olivedrab', node_size=200, 
        font_size=10, font_weight='bold', edge_color="olivedrab", alpha=0.5)

LHR or Heathrow Airport in London, England stands out as a boundary spanner in this graph.

Graph the Third Largest Airport Node 'Island' Subgraph - minimum weight level 73

In [48]:
APG2=max(nx.connected_component_subgraphs(airport_islands[2][1]), key=len)
In [49]:
# set plot size
plt.rcParams["figure.figsize"] = (2,2)

nx.draw(APG2, with_labels=True, node_color='olivedrab', node_size=200, 
        font_size=10, font_weight='bold', edge_color="olivedrab", alpha=0.5)

Frankfurt am Main Airport in Frankfurt, Germany and Charles de Gaulle Airport (Roissy Airport) in Paris, France are the two most influential airports in our network.

Airline Node Projected Graph

In [50]:
# compute an affiliation network of the Airlines
airlines = bi.weighted_projected_graph(B, airline_nodes)

# Find the largest connected subgraph in the network
#users_subgraph = nx.connected_component_subgraphs(users)[0]  # textbook code doesn't work
airlines_subgraph = max(nx.connected_component_subgraphs(airlines), key=len)  # alternate method

airlines_subgraph.name = "Airlines"

print(nx.info(airlines_subgraph))
Name: Airlines
Type: Graph
Number of nodes: 564
Number of edges: 35119
Average degree: 124.5355

Run island_method function using 3 iterations and a minimum weight 'water line' of 50...

In [51]:
airline_islands=island_method(airlines_subgraph, 3, 50)
print('min weight - ', '# of nodes - ', '# of island subgraphs')
for i in airline_islands:
    # print the threshold level, size of the graph, and number of connected components
    print(i[0], '     -     ', len(i[1]), '     -     ', len(list(nx.connected_component_subgraphs(i[1]))))
min weight -  # of nodes -  # of island subgraphs
50      -      75      -      3
192      -      7      -      1
334      -      2      -      1

Graph the Largest Airline Node 'Island' Subgraph - minimum weight level 50

In [52]:
ALG0=max(nx.connected_component_subgraphs(airline_islands[0][1]), key=len)
In [53]:
# set plot size
plt.rcParams["figure.figsize"] = (15,15)

nx.draw(ALG0, with_labels=True, node_color='cornflowerblue', node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

Graph the Second Largest Airline Node 'Island' Subgraph - minimum weight level 192

In [54]:
ALG1=max(nx.connected_component_subgraphs(airline_islands[1][1]), key=len)
In [55]:
# set plot size
plt.rcParams["figure.figsize"] = (6,6)

nx.draw(ALG1, with_labels=True, node_color='cornflowerblue', node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

The seven most influential airlines in the network in alphabetical order are:

  • AA - American Airlines
  • AF - Air France
  • AZ - Alitalia
  • DL - Delta Air Lines
  • KL - KLM Royal Dutch Airlines
  • UA - United Airlines
  • US - US Airways

Graph the Third Largest Airline Node 'Island' Subgraph - minimum weight level 334

In [56]:
ALG2=max(nx.connected_component_subgraphs(airline_islands[2][1]), key=len)
In [57]:
# set plot size
plt.rcParams["figure.figsize"] = (2,2)

nx.draw(ALG2, with_labels=True, node_color='cornflowerblue', node_size=200, 
        font_size=10, font_weight='bold', edge_color="skyblue", alpha=0.5)

The two most influential airlines in the network are clearly US and AA corresponding to US Airways and American Airlines respectively. Interestingly these do not correspond to the most influential airline in the bipartite network graph above, which was FR for Air France.

References